Uncertainty estimation with a finite dataset in the assessment of classification models
نویسندگان
چکیده
To successfully translate genomic classifiers to the clinical practice, it is essential to obtain reliable and reproducible measurement of the classifier performance. A point estimate of the classifier performance has to be accompanied with a measure of its uncertainty. In general, this uncertainty arises from both the finite size of the training set and the finite size of the testing set. The training variability is a measure of classifier stability and is particularly important when the training sample size is small. Methods have been developed for estimating such variability for the performance metric AUC (area under the ROC curve) under two paradigms: a smoothed cross-validation paradigm and an independent validation paradigm. The methodology is demonstrated on three clinical microarray datasets in the microarray quality control consortium phase two project (MAQC-II): breast cancer,multiplemyeloma, and neuroblastoma. The results show that the classifier performance is associated with large variability and the estimated performance may change dramatically on different datasets. Moreover, the training variability is found to be of the same order as the testing variability for the datasets and models considered. In conclusion, the feasibility of quantifying both training and testing variability of classifier performance is demonstrated on finite real-world datasets. The large variability of the performance estimates shows that patient sample size is still the bottleneck of the microarray problem and the training variability is not negligible. Published by Elsevier B.V.
منابع مشابه
Soft Computing Methods based on Fuzzy, Evolutionary and Swarm Intelligence for Analysis of Digital Mammography Images for Diagnosis of Breast Tumors
Soft computing models based on intelligent fuzzy systems have the capability of managing uncertainty in the image based practices of disease. Analysis of the breast tumors and their classification is critical for early diagnosis of breast cancer as a common cancer with a high mortality rate between women all around the world. Soft computing models based on fuzzy and evolutionary algorithms play...
متن کاملAsthma Control Level Assessment by Moving from the Current Reactive Care Models into a Preventive Approach based on Fuzzy Clustering and Classification Algorithms
Background and Aim: Asthma is a common and chronic disease of respiratory tracts. The best way to treat Asthma is to control it. Experts of this field suggest the continues monitoring on Asthma symptoms and adjustment of self-care plan with offering the preventive treatment program to have desired control over Asthma. Presenting these plans by the physician is set based on the control level in ...
متن کاملConditional Random Fields for Airborne Lidar Point Cloud Classification in Urban Area
Over the past decades, urban growth has been known as a worldwide phenomenon that includes widening process and expanding pattern. While the cities are changing rapidly, their quantitative analysis as well as decision making in urban planning can benefit from two-dimensional (2D) and three-dimensional (3D) digital models. The recent developments in imaging and non-imaging sensor technologies, s...
متن کاملA Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset
Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...
متن کاملInflation and Inflation Uncertainty in Iran: An Application of GARCH-in-Mean Model with FIML Method of Estimation
This paper investigates the relationship between inflation and inflation uncertainty for the period of 1990-2009 by using monthly data in the Iranian economy. The results of a two-step procedure such as Granger causality test which uses generated variables from the first stage as regressors in the second stage, suggests a positive relation between the mean and the variance of inflation. However...
متن کاملInterval network data envelopment analysis model for classification of investment companies in the presence of uncertain data
The main purpose of this paper is to propose an approach for performance measurement, classification and ranking the investment companies (ICs) by considering internal structure and uncertainty. In order to reach this goal, the interval network data envelopment analysis (INDEA) models are extended. This model is capable to model two-stage efficiency with intermediate measures i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational Statistics & Data Analysis
دوره 56 شماره
صفحات -
تاریخ انتشار 2012